-
Notifications
You must be signed in to change notification settings - Fork 387
Remove field id constraint on add files #2662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Remove field id constraint on add files #2662
Conversation
0b599c6 to
d580102
Compare
d580102 to
1addf60
Compare
pyiceberg/io/pyarrow.py
Outdated
| requested_id_to_name = requested_schema._lazy_id_to_name | ||
| provided_id_to_name = provided_schema._lazy_id_to_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jeroko Thanks for working on this, and adding this check.
However, I don't think we really care about the names; it is not a problem when they differ. However, if you add a file with a different schema, we can brick the table because of issues in the types. Should we check if the file contains the expected type for each of the IDs instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko Right, we should not care about the names if the IDs are provided, and the mapping between the IDs and the types was already checked in the call to _check_schema_compatible at the end of this function. In that case I didn't really need to add any extra check, just a new test to verify that files with matching field IDs and incompatible types fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! This is a great addition. Added a few comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we should at least check that the parquet field IDs align with the Iceberg field IDs
| `add_files` can work with Parquet files both with and without field IDs in their metadata: | ||
| - **Files with field IDs**: When field IDs are present in the Parquet metadata, they must match the corresponding field IDs in the Iceberg table schema. This is common for files generated by tools like Spark or when using or other libraries with explicit field ID metadata. | ||
| - **Files without field IDs**: When field IDs are absent, the table must have a [Name Mapping](https://iceberg.apache.org/spec/?h=name+mapping#name-mapping-serialization) to map field names to Iceberg field IDs. `add_files` will automatically create a Name Mapping based on the table's current schema if one doesn't already exist. | ||
| In both cases, a Name Mapping is created if the table doesn't have one, ensuring compatibility with various readers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For parquet files with field ID, i dont think we necessary need the name mapping if its aligned with the table schema field IDs
But we can address this separately
Rationale for this change
Closes #2131
The PR relaxes the constraint that prevented adding any file with field IDs, and replaces it with a constraint that prevents adding files which contain field IDs that are inconsistent with the field IDs of the table. If the field IDs are compatible, then they can be added safely, if not, they will be rejected.
Are these changes tested?
Yes
Are there any user-facing changes?
Yes